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Abstract 

In  information  extraction,  we  often  wish 
to  identify  all  mentions  of  an  entity,  such 
as  a  person  or  organization.  Tradition¬ 
ally,  a  group  of  words  is  labeled  as  an 
entity  based  only  on  local  information. 

But  information  from  throughout  a  doc¬ 
ument  can  be  useful;  for  example,  if 
the  same  word  is  used  multiple  times, 
it  is  likely  to  have  the  same  label  each 
time.  We  present  a  CRF  that  explicitly 
represents  dependencies  between  the  la¬ 
bels  of  pairs  of  similar  words  in  a  doc¬ 
ument.  On  a  standard  information  ex¬ 
traction  data  set,  we  show  that  learn¬ 
ing  these  dependencies  leads  to  a  13.7% 
reduction  in  error  on  the  field  that  had 
caused  the  most  repetition  errors. 

1  Introduction 

Most  natural-language  systems  solve  many  prob¬ 
lem  instances  in  their  lifetime.  Traditionally, 
these  instances  are  solved  separately,  that  is,  the 
data  is  assumed  to  be  independent  and  identically 
distributed.  But  often  this  assumption  does  not 
hold.  There  is  much  current  interest  in  collectively 
making  related  classification  decisions,  using  de¬ 
pendencies  between  decisions  to  increase  perfor¬ 
mance.  In  Web  page  classification,  for  example, 
Taskar  et  al.  (2002)  model  the  fact  that  pages  that 
hyperlink  to  each  other  are  more  likely  to  have 
the  same  type,  resulting  in  increased  classification 
performance. 
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Within  information  extraction,  one  important 
type  of  error  occurs  on  repeated  mentions  of  the 
same  field.  For  example,  we  often  wish  to  iden¬ 
tify  all  mentions  of  an  entity,  such  as  a  person  or 
organization,  because  each  mention  might  contain 
different  useful  information.  Furthermore,  these 
mentions  tend  to  use  similar  terms.  We  can  take 
advantage  of  this  fact  by  favoring  labelings  that 
treat  repeated  words  identically,  and  by  combining 
features  from  all  occurrences  so  that  the  extraction 
decision  can  be  made  based  on  global  information. 
However,  most  extraction  systems,  whether  prob¬ 
abilistic  or  not,  do  not  take  advantage  of  this  de¬ 
pendency,  instead  treating  the  separate  mentions 
independently. 

Recently,  Bunescu  and  Mooney  (2004)  use  a 
relational  Markov  network  (Taskar  et  al.,  2002)  to 
collectively  classify  the  mentions  in  a  document, 
achieving  increased  accuracy  by  learning  depen¬ 
dencies  between  similar  mentions.  In  their  work, 
however,  candidate  phrases  are  extracted  heuristi- 
cally,  which  can  introduce  errors  if  a  true  entity 
is  not  selected  as  a  candidate  phrase.  Ideally,  we 
would  like  to  perform  collective  segmentation  and 
labeling  simultaneously,  so  that  the  system  can 
take  into  account  dependencies  between  the  two 
tasks.  This  can  be  done  naturally  using  probabilis¬ 
tic  sequence  models. 

Traditional  probabilistic  sequence  models,  such 
as  HMMs,  arc  generative,  in  the  sense  that  they 
represent  a  joint  probability  distribution  p( y,x). 
Because  this  includes  a  distribution  p(x)  over  the 
input  features,  it  is  difficult  to  use  arbitrary,  over¬ 
lapping  features  while  maintaining  tractability.  A 
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solution  to  this  problem  is  to  model  instead  the 
conditional  distribution  p(y|x),  which  is  all  that 
is  needed  for  classification  anyway.  Because  the 
model  is  conditional,  dependencies  among  the  fea¬ 
tures  in  x  do  not  need  to  be  explicitly  repre¬ 
sented.  Popular  conditional  models  include  max¬ 
imum  entropy  classifiers  (Berger  et  ah,  1996)  and 
conditional  random  fields  (Lafferty  et  ah,  2001). 
Conditionally-trained  models  have  been  shown  to 
perform  better  than  generatively-trained  models 
on  many  tasks,  including  document  classification 
(Taskar  et  ah,  2002),  part -of- speech  tagging  (Rat- 
naparkhi,  1996),  extraction  of  data  from  tables 
(Pinto  et  ah,  2003),  segmentation  of  FAQ  lists 
(McCallum  et  ah,  2000),  and  noun-phrase  seg¬ 
mentation  (Sha  and  Pereira,  2003). 

To  perform  collective  labeling,  we  need  to  rep¬ 
resent  dependencies  between  distant  terms  in  the 
input.  But  this  reveals  a  general  limitation  of  se¬ 
quence  models,  whether  generatively  or  condition¬ 
ally  trained.  Sequence  models  usually  make  a 
Markov  assumption  among  labels,  that  is,  that  any 
label  yt  is  independent  of  all  previous  labels  given 
its  immediate  predecessors  yt-k  ■  ■  ■  yt- 1-  That  is, 
such  models  represent  dependence  only  between 
nearby  nodes — for  example,  between  bigrams  and 
trigrams — and  cannot  represent  the  higher-order 
dependencies  that  arise  when  identical  words  oc¬ 
cur  throughout  a  document. 

In  the  paper  we  present  a  conditional  model 
that  collectively  segments  a  document  into  men¬ 
tions  and  classifies  the  mentions  by  entity  type, 
taking  into  account  probabilistic  dependencies  be¬ 
tween  distant  mentions.  Although  n-gram  se¬ 
quence  models  cannot  represent  long-distance  de¬ 
pendencies,  more  general  graphical  models  can, 
by  adding  edges  between  the  labels  of  similar 
words.  We  introduce  the  skip-chain  CRF,  which 
is  a  CRF  whose  structure  is  a  linear  chain  with  ad¬ 
ditional  connections  between  all  pairs  of  similar 
words,  as  shown  in  Figure  2. 

Even  though  the  limitations  of  n-gram  models 
have  been  widely  recognized  within  natural  lan¬ 
guage  processing,  long-distance  dependencies  are 
difficult  to  represent  in  generative  models,  because 
full  n-gram  models  have  too  many  parameters  if  n 
is  large.  We  avoid  this  problem  by  selecting  which 
skip  edges  to  include  based  on  the  input  string. 


Figure  1:  Graphical  representation  of  linear-chain 
CRF.  Although  the  hidden  nodes  can  depend  on 
observations  at  any  time  step,  for  clarity  we  have 
shown  links  only  to  observations  at  the  same  time 
step. 


Figure  2:  Graphical  representation  of  skip-chain 
CRF.  Identical  words  are  connected  because  they 
are  likely  to  have  the  same  label. 

This  kind  of  input-specific  dependence  is  difficult 
to  represent  in  a  generative  model,  which  needs 
to  generate  the  input.  In  other  words,  conditional 
models  have  been  popular  because  of  their  flexibil¬ 
ity  in  allowing  overlapping  features;  in  this  paper, 
we  take  advantage  of  their  flexibility  in  allowing 
input-specific  model  structure. 

Because  our  model  contains  many  overlapping 
loops,  exact  inference  is  intractable.  For  some  of 
our  documents,  a  single  probability  table  in  the 
junction  tree  would  require  over  4  GB  of  memory. 
Instead,  we  perform  approximate  inference  us¬ 
ing  a  schedule  for  loopy  belief  propagation  called 
tree  reparameterization  (TRP)  (Wain wright  et  ah, 
2001). 

We  evaluate  our  model  on  a  standard  infor¬ 
mation  extraction  data  set  of  e-mail  seminal-  an¬ 
nouncements.  In  the  field  type  on  which  a  linear- 
chain  CRF  had  the  most  repetition  errors,  which  is 
speaker  name,  we  show  that  learning  these  depen¬ 
dencies  with  a  skip-chain  CRF  leads  to  a  13.7% 
reduction  in  error.  Failure  analysis  confirms  that 
the  reduction  in  error  is  due  to  increased  recall  on 
the  repeated  field  mentions. 


2  Linear-chain  CRFs 


Conditional  random  fields  (CRFs)  (Lafferty  et  al., 
2001)  are  undirected  graphical  models  that  encode 
a  conditional  probability  distribution  using  a  given 
set  of  features.  CRFs  arc  defined  as  follows.  Let  Q 
be  an  undirected  model  over  sets  of  random  vari¬ 
ables  y  and  x.  As  a  typical  special  case,  y  =  {y4} 
and  x  =  {xt}  for  t  =  1, . . . ,  T,  so  that  y  is  a 
labeling  of  an  observed  sequence  x. 

If  C  =  {{yc,xc}}  is  the  set  of  cliques  in  Q, 
then  CRFs  define  the  conditional  probability  of  a 
state  sequence  given  the  observed  sequence  as: 


PA(y|x) 


l 

Z(x) 


n  *(yc, 


X,. 


ceC 


(1) 


where  F  is  a  potential  function  and  the  partition 
function  Z(x)  =  Ey  TlceC  xc)  is  a  nor¬ 
malization  factor  over  all  state  sequences  for  the 
sequence  x.  We  assume  the  potentials  factorize 
according  to  a  set  of  features  {/&},  which  are 
given  and  fixed,  so  that 

<F(yc,  xc)  =  exp  Afc/fc(yc,  xc)^j  (2) 

The  model  parameters  are  a  set  of  real  weights 
A  =  {A/,.},  one  weight  for  each  feature. 

Most  applications  use  the  linear-chain  CRF,  in 
which  a  first-order  Markov  assumption  is  made  on 
the  hidden  variables.  A  graphical  model  for  this 
is  shown  in  Figure  1.  In  this  case,  the  cliques 
of  the  conditional  model  arc  the  nodes  and  edges, 
so  that  there  arc  feature  functions  fk(yt ,  Dt+ 1,  x,  t) 
for  each  label  transition.  (Here  we  write  the  fea¬ 
ture  functions  as  potentially  depending  on  the  en¬ 
tire  input  sequence.)  Feature  functions  can  be  ar¬ 
bitrary  functions  of  their  arguments.  For  example, 
a  feature  function  fk(yt ,  Ut+\  ■  x,  t)  could  be  a  bi¬ 
nary  test  that  has  value  1  if  and  only  if  yt  has  the 
label  Other,  yt+\  has  the  label  Speaker,  and  xt 
begins  with  a  capital  letter. 


3  Skip-chain  CRFs 

3.1  Model 

Linear-chain  CRFs  cannot  represent  dependencies 
between  distant  occurrences  of  similar  words.  In 
this  paper,  we  extend  linear-chain  CRFs  by  adding 


probabilistic  connections  between  similar  words, 
that  is,  adding  edges  between  them  in  an  undi¬ 
rected  linear-chain  model  such  as  Figure  2.  We 
call  these  additional  edges  skip  edges.  Skip  edges 
can  represent  dependence  between  distant  nodes, 
so  that  similar  words  can  be  labeled  similarly. 
Also,  the  features  on  skip  edges  can  incorporate 
information  from  the  context  of  both  endpoints,  so 
that  if  the  label  of  one  endpoint  is  more  certain,  it 
can  influence  the  label  of  the  other. 

First,  there  is  the  choice  of  which  skip  edges  to 
include.  We  may  choose  simply  to  connect  iden¬ 
tical  words,  but  more  generally  we  could  can  any 
pair  of  words  that  we  believe  to  be  similar,  for  ex¬ 
ample,  pairs  of  words  that  belong  to  the  same  stem 
class,  or  have  small  edit  distance.  Of  course,  if  we 
simply  connected  all  possible  words,  inference  in 
the  model  would  require  summing  over  all  possi¬ 
ble  state  sequences;  no  efficient  dynamic  program¬ 
ming  algorithm  would  be  available.  So  we  need  to 
use  similarity  metrics  that  result  in  a  sufficiently 
sparse  graph.  In  this  paper,  we  focus  on  named- 
entity  recognition,  so  we  connect  pairs  of  identical 
capitalized  words. 

This  model  can  be  formally  defined  by  adding  a 
second  type  of  potential  to  the  linear-chain  model. 
For  an  instance  x,  let  1  =  {(rt,  n)}  be  the  set  of 
all  pairs  of  sequence  positions  for  which  there  arc 
skip  edges.  Then  we  can  write  the  probability  of  a 
label  sequence  y  given  an  input  x  and  parameters 
A  as 


1 

PA(y|x)  =  II 

1  '  t= 0 

Yl  ^(yu,yv,x,u,v),  (3) 

(u,v)£X 

where  F  is  the  potential  over  linear-chain  edges, 
and  F  is  the  potential  over  skip  edges.  We  assume 
that  each  of  the  potentials  factorize  according  to  a 
set  of  features  fk  so  that 

log$(yt,yt+i,x,t)  =  ^Afe/fc(yt,yt+i,x,t) 
k 

(4) 

log  V(yu,yv,x.,u,v)  =  ^  \'kfk{yu,yv,xL,u,v). 

k 


(5) 


Note  that  each  type  of  clique  has  its  own  set  of 
features  and  weights. 

For  the  short  distance  edges,  we  factorize  our 
features  as 

fk(yt,yt+i,*-,t)  =  Pk(yt,yt+i)qk(x-,t)  (6) 

where  pk(yt,c)  is  a  binary  function  on  the  assign¬ 
ment,  and  <7fc(x,  t)  is  a  function  solely  of  the  in¬ 
put  string,  which  we  call  an  input  feature .  In  gen¬ 
eral  t)  can  depend  on  arbitrary  positions  of 
the  input  string.  For  example,  a  useful  feature  for 
NER  is  t)  =  1  if  and  only  if  xt+i  is  a  capi¬ 
talized  word. 

For  the  skip  edges,  we  factorize  our  features  in 
in  a  way  that  is  similar  but  allows  us  to  combine 
distant  features  in  the  sequence.  More  specifically, 
we  factorize  the  features  for  the  skip  edges  as 

fk(yu,  yv,x,  u,  v )  =  p'k(yu,  yv,  u,  v)qk{x,  u,  v ) 

(7) 

The  input  features  qk{x.,  u,  v )  can  combine  infor¬ 
mation  from  the  neighborhood  of  yu  and  yv.  For 
example,  one  useful  feature  is  q'k(x.,u,v)  =  1  if 
and  only  if  xu  =  xv  =  “Booth”  and  xv-\  = 
“Speaker:”.  This  can  be  a  useful  feature  if  the 
context  around  xu,  such  as  “Robert  Booth  is  man¬ 
ager  of  control  engineering. . . ,”  may  not  make 
clear  whether  or  not  Robert  Booth  is  presenting 
a  talk,  but  the  context  around  xv  is  clear,  such  as 
“Speaker:  Robert  Booth.”  1 

3.2  Inference 

The  inference  problem  is,  given  an  input  string  x, 
to  compute  marginal  distributions  p(y,|x)  or  the 
most  likely  (Viterbi)  labeling  argmaxy  p(y|x). 
Computing  marginals  is  needed  for  parameter  es¬ 
timation,  and  the  Viterbi  labeling  is  used  to  label 
a  new  sentence.  In  linear-chain  CRFs,  exact  in¬ 
ference  can  be  performed  efficiently  by  valiants 
of  the  standard  dynamic-programming  algorithms 
for  HMMs. 

In  skip-chain  CRFs,  inference  is  much  more  dif¬ 
ficult.  Because  the  loops  can  be  long  and  over¬ 
lapping,  exact  inference  is  intractable  for  the  data 
we  consider  in  the  next  section.  Exact  inference 

'This  example  is  taken  from  an  actual  error  made  by  a 
linear-chain  CRF  on  the  seminars  data  set.  We  present  results 
from  this  data  set  in  Section  4. 


requires  time  exponential  in  the  size  of  a  graph's 
junction  tree.  For  the  seminars  data,  29  of  the  485 
instances  have  a  maximum  clique  size  of  10  or 
greater,  and  1 1  have  a  maximum  clique  size  of  14 
or  greater.  (The  worst  instance  has  a  clique  with  61 
nodes.)  For  reference,  representing  a  single  poten¬ 
tial  over  14  nodes  requires  more  memory  than  can 
be  addressed  in  a  32-bit  architecture.  Thus,  exact 
inference  is  not  practical  for  skip-chain  CRFs. 

Instead,  we  perform  approximate  inference  us¬ 
ing  the  loopy  belief  propagation  algorithm.  Loopy 
belief  propagation  is  an  iterative  method  that  is 
not  guaranteed  to  converge,  but  has  been  found  to 
be  reasonably  accurate  in  practice  (Murphy  et  ah, 
1999).  We  use  an  asynchronous  tree-based  sched¬ 
ule  known  as  TRP  (Wain wright  et  ah,  2001). 

Loopy  belief  propagation  is  a  generalization  of 
the  forward-backward  algorithm  for  HMMs  and 
linear-chain  CRLs,  a  fact  which  provides  intu¬ 
ition  about  the  skip-chain  CRL  model.  In  the 
forward-backward  algorithm,  probabilistic  infor¬ 
mation  flows  only  between  neighboring  nodes,  via 
the  a  and  f3  recursions.  In  a  skip-chain  CRL,  be¬ 
lief  propagation  passes  messages  not  only  between 
neighboring  nodes,  as  in  forward-backward,  but 
also  along  the  long  distance  edges,  propagating 
probabilistic  information  from  distant  sequence 
locations. 

3.3  Parameter  Estimation 

Given  training  data  V  =  {xW,yWjW1?  we  esq_ 
mate  model  parameters  A  =  {A/,.}  by  maximum  a 
posteriori  (MAP)  estimation.  We  optimize  the  the 
posterior  probability 

logp(A|D)  oc  logp(X>|A)  +  logp(A)  (8) 
=  £(A) +  logp(A),  (9) 

where  £(A)  is  the  log  likelihood 

£(A)  =  ^logpA(yw|x«).  (10) 

i 

The  derivative  of  the  likelihood  with  respect  to 
one  of  the  short-distance  parameters  \k  is 

=  ]TCfc(y«  x»)  _£fc(yW  x«)  (11) 


where  Ck  and  Ek  arc  constraints  and  expectations 


given  by 

Ck(  y(,),x(i))  = 

Y  /fc(%(,) >  vt+i ’  » *)  (12) 

t 

Ek{  yw,xw)  = 

Y  Y 

f  yt,yt+ 1 

fk(yt,yt+i,x® ,t)  (13) 

The  derivative 

Jp-  with  respect  to  the  long- 

distance  parameters  is  similar,  with  X'k  replacing 
Xk  and  f'k  replacing  fk. 

To  reduce  overfitting,  we  define  a  prior  p( A) 
over  parameters.  We  use  a  spherical  Gaussian 
prior  with  mean  /t  =  0  and  covariance  matrix 
S  =  a2 1,  so  that  the  gradient  becomes 

dp(A\V)  8C  Xk 

dXk  dXk  a 2 ' 

See  Peng  and  McCallum  (2004)  for  a  comparison 
of  different  priors  for  linear-chain  CRFs. 

The  function  p(A\V)  is  convex,  and  can  be 
optimized  by  any  number  of  techniques,  as  in 
other  maximum-entropy  models  (Lafferty  et  ah, 
2001;  Berger  et  ah,  1996).  In  the  results  below, 
we  use  L-BFGS,  which  has  previously  outper¬ 
formed  other  optimization  algorithms  for  linear- 
chain  CRFs  (Sha  and  Pereira,  2003;  Malouf, 
2002). 

4  Results 

We  evaluate  skip-chain  CRFs  on  a  collection 
of  485  e-mail  messages  announcing  seminars  at 
Carnegie  Mellon  University.  The  messages  are 
annotated  with  the  seminar’s  starting  time,  ending 
time,  location,  and  speaker.  This  data  set  is  due  to 
Dayne  Freitag  (Freitag,  1998),  and  has  been  used 
in  much  previous  work. 

Often  the  fields  arc  listed  multiple  times  in  the 
message.  For  example,  the  speaker  name  might 
be  included  both  near  the  beginning  and  later  on, 
in  a  sentence  like  “If  you  would  like  to  meet  with 
Professor  Smith. . .  ”  It  can  be  useful  to  find  both 
such  mentions,  because  different  information  can 
be  in  the  surrounding  context  of  each  mention:  for 
example,  the  first  mention  might  be  near  an  insti¬ 
tution  affiliation,  while  the  second  mentions  that 
Smith  is  a  professor. 


wt  =  w 

Wt  matches  [A-Z]  [a-z]  + 

Wt  matches  [A-Z]  [A-Z]  + 

Wt.  matches  [A-Z  ] 

Wt  matches  [A-Z  ]  + 

Wt  matches  [ A-Z  ]  +  [  a-z  ]  +  [ A-Z  ]  +  [  a-z  ] 

Wt.  appears  in  list  of  first  names, 
last  names,  honorifics,  etc. 

Wt  appears  to  be  part  of  a  time  followed  by  a  dash 
Wt  appears  to  be  part  of  a  time  preceded  by  a  dash 
Wt  appears  to  be  part  of  a  date 

Tt=T  _ ^ _ 

qk  (x,  t  +  S)  for  all  fc  and  8  £  [—4, 4| 


Table  1:  Input  features  c//;.(x,  t)  for  the  seminars 
data.  In  the  above  wt  is  the  word  at  position  t,  T,  is 
the  POS  tag  at  position  t,  w  ranges  over  all  words 
in  the  training  data,  and  T  ranges  over  all  part-of- 
speech  tags  returned  by  the  Brill  tagger.  The  “ap¬ 
peal's  to  be”  features  are  based  on  hand-designed 
regular  expressions  that  can  span  several  tokens. 


Field 

Linear-chain 

Skip-chain 

stime 

12.6 

17 

etime 

3.2 

5.2 

location 

6.4 

0.6 

speaker 

30.2 

4.8 

Table  3:  Number  of  inconsistently  mislabeled 
tokens,  that  is,  tokens  that  are  mislabeled  even 
though  the  same  token  is  labeled  correctly  else¬ 
where  in  the  document.  Learning  long-distance 
dependencies  reduces  this  kind  of  error  in  the 
speaker  and  location  fields.  Numbers  are  averaged 
over  5  folds. 

We  evaluate  a  skip-chain  CRF  with  skip  edges 
between  identical  capitalized  words.  The  motiva¬ 
tion  for  this  is  that  the  hardest  aspect  of  this  data 
set  is  identifying  speakers  and  locations,  and  cap¬ 
italized  words  that  occur  multiple  times  in  a  semi¬ 
nar  announcement  are  likely  to  be  either  speakers 
or  locations. 

Table  1  shows  the  list  of  input  features  we  used. 
For  a  skip  edge  (u,  v),  the  input  features  we  used 
were  simply  the  disjunction  of  the  input  features 
at  u  and  v,  that  is, 

4(x,  u,  v )  =  qk(x,  u)  ©  qfc(x,  v )  (14) 

where  ©  is  binary  or.  All  of  our  results  are  av¬ 
eraged  over  5 -fold  cross-validation  with  an  80/20 


System 

stime 

etime 

location 

speaker 

overall 

BIEN  (Peshkin  and  Pfeffer,  2003) 

96.0 

98.8 

87.1 

76.9 

89.7 

Linear-chain  CRF 

97.5 

97.5 

88.3 

77.3 

90.2 

Skip-chain  CRF 

96.7 

97.2 

88.1 

80.4 

90.6 

Table  2:  Comparison  of  F\  performance  on  the  seminars  data.  The  top  line  gives  a  dynamic  Bayes  net 
that  has  been  previously  used  on  this  data  set.  The  skip-chain  CRF  beats  the  previous  systems  in  overall 
FI  and  on  the  speaker  field,  which  has  proved  to  be  the  hardest  field  of  the  four.  Overall  FI  is  simply  the 
average  of  the  FI  scores  for  the  four  fields. 


split  of  the  data.  We  report  results  from  both  a 
linear-chain  CRF  and  a  skip-chain  CRF  with  the 
same  set  of  input  features. 

We  calculate  precision  and  recall  as2 

#  tokens  extracted  correctly 

#  tokens 

#  tokens  extracted  correctly 

_/l  — 

#  tokens 

As  usual,  we  report  F\  =  (2 PR) /(P  +  R). 

Table  2  compares  a  skip-chain  CRF  to  a  linear- 
chain  CRF  and  to  a  dynamic  Bayes  net  used  in  pre¬ 
vious  work  (Peshkin  and  Pfeffer,  2003).  The  skip- 
chain  CRF  does  much  better  than  all  the  other  sys¬ 
tems  on  speaker,  which  is  the  label  for  which  the 
skip  edges  would  be  expected  to  make  the  most 
difference.  On  the  other  fields,  however,  the  skip- 
chain  CRF  does  slightly  worse  (less  than  1%  ab¬ 
solute  FI). 

We  expected  that  the  skip-chain  CRF  would 
do  especially  well  on  the  speaker  held,  because 
speaker  names  tend  to  appeal-  multiple  times  in  a 
document,  and  a  skip-chain  CRF  can  learn  to  la¬ 
bel  the  multiple  occurrences  consistently.  To  test 
this  hypothesis,  we  measure  the  number  of  incon¬ 
sistently  mislabeled  tokens,  that  is,  tokens  that  are 
mislabeled  even  though  the  same  token  is  clas¬ 
sified  correctly  elsewhere  in  the  document.  Ta¬ 
ble  3  compares  the  number  of  inconsistently  mis¬ 
labels  tokens  in  the  test  set  between  linear-chain 
and  skip-chain  CRFs.  For  the  linear-chain  CRF, 
on  average  30.2  true  speaker  tokens  are  inconsis¬ 
tently  mislabeled.  Because  the  linear-chain  CRF 
mislabels  121.6  true  speaker  tokens,  this  situation 

2  Previous  work  on  this  data  set  has  traditionally  measured 
precision  and  recall  per  document,  that  is,  from  each  docu¬ 
ment  the  system  extracts  only  one  field  of  each  types.  In  this 
paper  we  discuss  the  problem  in  which  we  wish  to  extract  all 
the  mentions  in  a  document,  so  we  cannot  compare  with  this 
previous  work. 


includes  24.7%  of  the  missed  speaker  tokens.  So 
treating  repeated  tokens  consistently  would  seem 
to  especially  benefit  recall  on  speaker. 

In  fact,  on  the  speaker  held,  skip-chain  CRFs 
show  a  dramatic  decrease  in  inconsistently  misla¬ 
beled  tokens,  from  30.2  to  4.8.  Because  of  this, 
skip-chain  CRF  has  much  better  recall  on  speaker 
tokens  than  the  linear-chain  CRF  (70.0  R  linear 
chain,  76.8  R  skip  chain).  This  explains  the  in¬ 
crease  in  FI  between  between  linear-chain  and 
skip-chain  CRFs,  for  the  two  have  similar  preci¬ 
sion  (average  86.5  P  linear  chain,  85. 1  skip  chain). 

On  the  location  held,  on  the  other  hand,  where 
we  might  also  expect  skip-chain  CRFs  to  perform 
better,  there  is  no  beneht.  We  explain  this  by  ob¬ 
serving  in  Table  3  that  inconsistent  misclassihca- 
tion  occurs  much  less  frequently  in  this  held. 

5  Summary 

We  have  demonstrated  that  modeling  long¬ 
distance  dependencies  can  be  used  to  obtain  more 
accurate  information  extraction.  Because  our 
model  is  conditional,  its  structure  is  free  to  depend 
on  the  input  string.  In  this  paper,  we  connected 
pairs  of  identical  words,  on  the  assumption  that 
they  would  tend  to  have  the  same  label. 

The  framework  for  inference  and  learning  is 
similar  to  Relational  Markov  Networks  (Taskar  et 
al.,  2002)  and  Dynamic  CRFs  (Sutton  et  al.,  2004). 
Like  skip-chain  CRFs,  RMNs  vary  their  graphical 
structure  based  on  the  input  data,  but  in  practice 
RMNs  have  been  used  for  classihcation  instead  of 
sequence  labeling.  DCRFs,  on  the  other  hand,  are 
sequence  models,  but  the  skip-chain  CRF  model 
is  not  a  DCRF,  because  usually  different  long¬ 
distance  edges  are  used  for  each  input  string. 

The  skip-chain  CRF  can  also  be  viewed  as 
performing  extraction  while  taking  into  account 


a  simple  form  of  coreference  information,  since 
the  reason  that  identical  words  arc  likely  to  have 
similar  tags  is  that  they  arc  likely  to  be  coref¬ 
erent.  Thus,  this  model  is  a  step  toward  joint 
probabilistic  models  for  extraction  and  data  min¬ 
ing  as  advocated  by  McCallum  and  Jensen  (2003). 
An  example  of  such  a  joint  model  is  the  one  of 
Wellner  et  al.  (2004),  which  jointly  segments  cita¬ 
tions  in  research  papers  and  predicts  which  cita¬ 
tions  refer  to  the  same  paper. 
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